flowchart LR D[Data exported\nfrom REDCAP]-->C[Data cleaning] C-->S[Statistical analysis] S-->P[Presentation\nof results]
Jeremy Lew
May 27, 2024
flowchart LR D[Data exported\nfrom REDCAP]-->C[Data cleaning] C-->S[Statistical analysis] S-->P[Presentation\nof results]
install.packages("palmerpenguins") in the consolelibrary(palmerpenguins)remove.packages("palmerpenguins") in the consoleConsole: where you run/execute lines of code
To assign a variable, we use the <- operator
Environment: where you’re able to see variables stored in random-access memory
rm(list = ls()) in consolepalmerpenguins and tidyverse packages. Install them if you have not already done sopenguins dataset from palmerpenguins packagebody_mass_g (continuous) into a categorical variable with 2 categories (“light”: <= 4750g and “heavy”: > 4750g)body_mass_g as the dependent variable and independent variables: species, bill_length_mm, bill_depth_mm, flipper_length_mm and sexFactor datatype for working with categorical variables
A factor variable is more useful than a plain character vector. E.g. the first level will be used by glm as the reference category
Data structures are containers for holding data
A character vector holds multiple characters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
A numeric vector holds multiple numbers
Exercise
length(letters)letters[1:5]letters[c(1, 9, 20)]letters[length(letters):1]Unnamed lists
Named lists
Exercise
mylist <- list(a = "a", b = 1, c = 10:15)length(mylist)mylist[1]mylist$a or mylist[["a"]]mylist[3] and mylist[[3]]class(mylist[3]) and class(mylist[[3]])Dataframe is a tabular container that can hold multiple data types like lists, but each column can only store data of the same data type
To view the first 2 rows of the penguins dataset from the palmerpenguins package
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 2 more variables: sex <fct>, year <int>
To view the last 2 rows of the dataset
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 50.8 19 210 4100
2 Chinstrap Dream 50.2 18.7 198 3775
# ℹ 2 more variables: sex <fct>, year <int>
To view the data in rstudio, execute view(penguins)
Exercise
penguins dataset using dim(penguins), ncol(penguins), nrow(penguins)?glimpse(penguins)?summary(penguins)?To get the column/variable-names of your dataset
To access any column variable, we use the $ syntax
To get a table count of a variable or a cross-table count of 2 variables
See Data Cleaning for more ways to work with dataframes
read_csvread_excelhaven packagelabelled packageA good chapter on explaining functions is Chapter 6, Advanced R, where most of the examples in the slides are taken
Example function
Components of a function
Functions are objects that can be assigned or they can be anonymous functions
How do we invoke the function?
How do we invoke one function after another?
Suppose we have another function f02
To call f01 first followed by f02 and sqrt we will “nest” the functions (inside-out, right to left)
Another way to invoke a series of functions is to use the pipe operator %>% provided by the magrittr package
Exercise
f02 and f02(1, 2)?What is returned by a function? The last evaluated expression
How do we store values outputted from a function? With assignment statment
Early termination of a function with a return statement
Pass by copy vs pass by reference
In R, all functions are pass by reference if you don’t modify the object subsequently within the function. i.e. copy-on-modify
How to pass arguments?
By position e.g. f04(10, 50, 20)
By name e.g. f04(z = 20, y = 50, x = 10)
Unpacking multiple named arguments using do.call
Names defined inside a function mask names defined outside a function
R will look up a variable in the enclosing scope (e.g. within the function) and if it’s not found, will continue to proceed upwards (e.g. enclosing function or global environment) to look for the variable until it is found
R looks for the values when the function is run, not when the function is created
The select function enables you to select a subset of the columns in your dataset
The filter function enables you to select a subset of the rows that meet a certain criteria
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 165 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The mutate function enables you to add columns to your dataset. The added columns can be derived from existing column(s)
# A tibble: 5 × 2
body_mass_g body_mass_100g
<int> <dbl>
1 3750 37.5
2 3800 38
3 3250 32.5
4 NA NA
5 3450 34.5
The summarise function enables you to get summary statistics like N, mean, median etc
# A tibble: 1 × 2
N mean_body_mass_g
<int> <dbl>
1 344 NA
The group_by function enables you to get summary statistics within groups
# A tibble: 3 × 4
sex N mean_body_mass_g median_body_mass_g
<fct> <int> <dbl> <dbl>
1 female 165 3862. 3650
2 male 168 4546. 4300
3 <NA> 11 NA NA
The function arrange enables us to sort by a certain variable
# A tibble: 9 × 3
# Groups: sex [3]
sex year mean_body_mass_g
<fct> <int> <dbl>
1 female 2009 3874.
2 male 2009 4521.
3 <NA> 2009 NA
4 female 2008 3888.
5 male 2008 4632.
6 <NA> 2008 4650
7 female 2007 3821.
8 male 2007 4479.
9 <NA> 2007 NA
The functions if_else and case_when are often used with mutate to create new variables
# A tibble: 4 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 43.5 18.1 202 3400
2 Gentoo Biscoe 43.6 13.9 217 4900
3 Gentoo Biscoe 48.1 15.1 209 5500
4 Gentoo Biscoe 46.8 14.3 215 4850
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_type <chr>
The ggplot2 package is well known for plotting figures
| Variable 1 | Variable 2 | Bivariate test |
|---|---|---|
| Categorical | Categorical | 1. Chi-square test 2. Fisher’s exact test |
| Categorical | Continuous | Parametric: 1. Independent t-test (2 categories) 2. One-way independent ANOVA (>2 categories) |
| Categorical | Continuous | Non-parametric: 1. Mann-Whitney test (2 categories) a.k.a. Wilcoxon rank-sum test 2. Kruskal-Wallis test (>2 categories) |
| Continuous | Continuous | Parametric: Pearson’s correlation coefficient |
| Continuous | Continuous | Non-parametric: Spearman’s correlation coefficient |
| Variable 1 | Variable 2 | Bivariate test |
|---|---|---|
| Categorical | Categorical time (e.g. baseline, 12-month) |
McNemar’s test |
| Continuous | Categorical time (e.g. baseline, 12-month) |
Parametric: Dependent t-test |
| Continuous | Categorical time (e.g. baseline, 12-month) |
Non-parametric: Wilcoxon signed-rank test |
tableby function of the arsenal package lets you do this easilylibrary(arsenal)
1tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
knitr::kable(align = c("l", rep("r", 5)))tableby function invoked by these 3 lines
| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.791 (2.663) | 48.834 (3.339) | 47.505 (3.082) | 43.922 (5.460) | |
| - Median (Q1, Q3) | 38.800 (36.750, 40.750) | 49.550 (46.350, 51.075) | 47.300 (45.300, 49.550) | 44.450 (39.225, 48.500) | |
| - Range | 32.100 - 46.000 | 40.900 - 58.000 | 40.900 - 59.600 | 32.100 - 59.600 |
paired function from arsenal package# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 35.6 17.5 191 3175
2 Chinstrap Dream 50.7 19.7 203 4050
3 Adelie Torgersen 39.7 18.4 190 3900
4 Gentoo Biscoe 43.2 14.5 208 4450
5 Gentoo Biscoe 47.2 13.7 214 4925
# ℹ 2 more variables: sex <fct>, year <int>
flowchart LR A[Model specification] --> B[Inference] --> C[Diagnostics]
GLM consists of 2 components. See notations below1
Component 1: What distribution does your outcome, \(y_i\) take on?
Component 2: What is the link function \(g\) you want to use?
~ is the dependent variable, while RHS are the independent variablesfamily argument contains the 2 components of the GLM we discussed earlier
Call:
glm(formula = body_mass_g ~ species + bill_length_mm + bill_depth_mm +
flipper_length_mm + sex, family = gaussian(link = "identity"),
data = penguins)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1460.995 571.308 -2.557 0.011002 *
speciesChinstrap -251.477 81.079 -3.102 0.002093 **
speciesGentoo 1014.627 129.561 7.831 6.85e-14 ***
bill_length_mm 18.204 7.106 2.562 0.010864 *
bill_depth_mm 67.218 19.742 3.405 0.000745 ***
flipper_length_mm 15.950 2.910 5.482 8.44e-08 ***
sexmale 389.892 47.848 8.148 7.97e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 82563.33)
Null deviance: 215259666 on 332 degrees of freedom
Residual deviance: 26915647 on 326 degrees of freedom
(11 observations deleted due to missingness)
AIC: 4723.9
Number of Fisher Scoring iterations: 2
family = binomial(link = "logit")We will use the mpcmp6 R package’s implementation of the CMP\(\mu\)-regression
epiR packageImage from R Markdown cookbook chapter 2.1
Create an rmarkdown document
Knit your document into html
The consort package provides a convenient consort_plot function to plot CONSORT diagrams for reporting inclusion/exclusion/allocation (see the vignette)
library(consort)
flow_diagram <- consort_plot(data = penguins %>%
mutate(id = row_number(),
exclusion = case_when(island == "Dream" ~ "Penguins on Dream island",
year == 2007 ~ "Data collected in 2007",
TRUE ~ NA_character_) %>%
fct_relevel("Penguins on Dream island",
"Data collected in 2007")),
orders = c(id = "Study population",
exclusion = "Excluded",
id = "Data for analysis"),
side_box = "exclusion",
cex = 1.2)
plot(flow_diagram)For presenting in html documents, the plot function does not render the plot fully so when that happens, save it as an image first
Next, load your image using knitr::include_graphics
pacman::p_load(arsenal, knitr)
tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
1knitr::kable(align = c("l", rep("r", 5)))kable function converts the table object to html
| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.791 (2.663) | 48.834 (3.339) | 47.505 (3.082) | 43.922 (5.460) | |
| - Median (Q1, Q3) | 38.800 (36.750, 40.750) | 49.550 (46.350, 51.075) | 47.300 (45.300, 49.550) | 44.450 (39.225, 48.500) | |
| - Range | 32.100 - 46.000 | 40.900 - 58.000 | 40.900 - 59.600 | 32.100 - 59.600 |
The kableExtra package provides useful functions to format tables e.g. changing column width, changing colors, font size, fixed header rows etc.
pacman::p_load(arsenal, knitr, kableExtra)
tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 6, width_min = "2.5cm")| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.791 (2.663) | 48.834 (3.339) | 47.505 (3.082) | 43.922 (5.460) | |
| - Median (Q1, Q3) | 38.800 (36.750, 40.750) | 49.550 (46.350, 51.075) | 47.300 (45.300, 49.550) | 44.450 (39.225, 48.500) | |
| - Range | 32.100 - 46.000 | 40.900 - 58.000 | 40.900 - 59.600 | 32.100 - 59.600 |